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Introduction 

Processing video data is problematic due to the high data 
rates involved. Television quality video requires approxi- 
mately 100 GBytes for each hour, or about 27 MBytes for 
each second. Such data sizes and rates severely stress stor- 
age systems and networks and make even the most trivial 
real-time processing impossible without special purpose 
hardware. Consequently, most video data is stored in a 
compressed format. 

While compression decreases storage and network costs, 
it increases processing cost because the data must be decom- 
pressed first. The overhead of decompression is enormous: 
today's sophisticated compression algorithms, such as JPEG 
or MPEG, require between 150 and 300 instructions per 
pixel for decompression [1]. This corresponds to a rate of 
2.7 billion instructions for each second of NTSC quality 
video processed. Furthermore, the data must often be com- 
pressed after processing, incurring additional overhead. 

tThis research was supported by the National Science Foundation under 
grants DCR-85-07256 and MIP-90- 1 4940. 



One way to circumvent these problems is to process the 
video data in compressed form. This technique reduces the 
amount of data that must be processed and avoids complex 
compression and decompression. Decreasing data volume has 
the side-effect of increasing data locality and more effectively 
using the processor cache, improving performance further. 

In a previous paper [2], we showed how to perform 
scaJar and pixel-wise addition or multiplication directly on 
motion-JPEG video. In this paper, we extend this work to 
show how a wider class of operations, where each pixel in 
the output image is a linear combination of several pixels in 
the input image, can be computed in the compressed 
domain. Doing so is challenging because such operations 
often cross JPEG block boundaries. We address this prob- 
lem by writing the operations as tensors to capture the 
block structure of the compressed image data. We then 
show how to express JPEG compression and decompres- 
sion as tensors, allowing the compressed domain equivalent 
of the image operator to be computed. 

Unfortunately, the resulting operation turns out to be no 
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faster than spatial domain processing. Consequently, we 
have developed an approximation technique called conden- 
sation which introduces a dead-zone in the compressed 
domain operator. Condensation dramatically reduces the 
cost of computing an image, but degrades its quality. This 
speed/quality trade-off is studied later. A prototype imple- 
mentation shows that this technique runs at rates 
approaching real-time on current generation computers. For 
example, a smoothing filter can be applied to a 320 x 240 
JPEG-encoded image in about 75 ms on a DEC alpha 
workstation, which is approximately 12 frames/s. 



Images, Image Processing, and JPEG as Tensors 

Image Representation and Manipulation 

A gray-scale image, f, is conventionally represented as a 
matrix of pixels f ap , where a and p specify the row and 
column position of the pixel. Many operations on images 
can be expressed using linear combinations of these pixels. 
For instance, one way to express a smoothing operation is 
this: a pixel in the output image, g, is half the value of the 
corresponding pixel in the input image, f, plus one-eighth 
the sum of the four neighboring pixels: 



f q-7,/3 +f g,/?-7 +f a+i.ft +f a,/?+/ *a,p_ m 
8 2 



or, in the operation of shrinking an image by a factor of 
two, each pixel in g is the average of four pixels in f: 

f 2«,2/3 + f 2a+ 7,2/3 + f 2«,2ft+/ + f 2«+J,2/?+/ /~\ 

g«/5= - 4 w 

If the coefficients in these operations are gathered in a four- 
dimensional matrix, T, the output image, g, is the product 
of T and the input image f: 

op 

For example, in Eqn. (1), T is 

1/2 ifa = yandP = 5 
1/8 if a = yandp = 5±7 
1/8 if a = y ± 1 and P = 5 
0 otherwise 



\l/4 if y = Lcx/ 
[0 otherwise 



2J and5 = [p/2j 



T ap Y 5 - 



and in Eqn. (2) T is 



T is a fourth rank tensor (a four-dimensional matrix) that 
maps a second rank tensor (a two-dimensional matrix - the 
input image f) into another second rank tensor (the output 
image g). We make no assumptions about the structure of T, 
although in practice, T is very sparse, since in most image 
processing operations an output pixel depends on few input 
pixels. The tensor representation is quite flexible. Since each 
output pixel is a distinct linear combination of the input 
pixels, it can capture image processing operations not easily 
expressed in other formulations. For example, T can repre- 
sent an operator that blurs one part of an image and sharpens 
another, or an operator that performs different affine trans- 
formations on different parts of the image, as in morphing. 

Suppose we want to apply T directly on a JPEG-encoded 
image (JPEG is described in the next section). One problem 
that arises is that T may cross block boundaries. To capture 
this block structure, we represent f as a fourth rank tensor 
f x with a = 8x 4- i and P = 8y 4- j. The first pair of 
indices, x and y, specify the block address, and the second 
pair, i and 7, specify the pixel offset within the block, as 
shown in Figure 1. Images that are represented this way are 
called block-orientated images. 

To capture block structure in tensor operators, we con- 
vert the fourth rank tensor, into an eighth rank 
tensor, T , with the correspondence a = 8x 4- i, p = 

lww ' xyijwzuv* 1 ( . 

8y + j ? y = 8w + u, and 8 = 8z + v. This new tensor 
maps one block-oriented image to another 

Swzuv = Ixyijwzuv^xyij 

xyij 

which we can abbreviate g = Tf. 

T is an eighth rank tensor that maps a fourth rank tensor 
(the input image f, in block representation) into another 
fourth rank tensor (the output image f). To understand T, 
the idea of a block transform (BT) is introduced. A BT 
maps one 8 X 8 block of pixels to another. To compute a 
block in an output image, g, BTs are applied to each input 
block in f and the transformed blocks are summed pixel- 
wise. T is a four-dimensional array of BTs, with two 
indices specifying the output block (w,z), and two specify- 
ing the input block (x,y). 

For example, consider scaling a 32 X 16 pixel image f to 
a 16 X 8 image g (a shrink-by-2 operation), as shown in 
Figure 2. To compute the left block in g, eight BTs are 
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Figure 1. Block-oriented pixel addressing. 




Figure 2. Shrink-by-2 example. 

applied to the blocks in f. The resulting blocks, shown in 
the top of the figure, are added pixel-wise to produce the 
output block. This strategy is repeated to the right block 
in g. 



run length encodes the vector, and step six computes the 
difference between the DC value of this block and the DC 
value of the previous block (DPCM) and applies an entropy 
coding technique (either Huffman or arithmetic) to the 
result. 



JPEG compression as a tensor 

Having shown how to express images and their operators as 
tensors, we now turn to the task of expressing JPEG com- 
pression and decompression as a tensor. To do so, we have 
to rearrange the steps of the baseline JPEG algorithm 
slightly. Briefly, JPEG divides an image into 8X8 blocks 
and applies six steps to each block. Step one applies the 
discrete cosine transform (DCT) to the block. Step two 
orders the 64 DCT coefficients into a 64 element vector 
using a zig-zag scan, a heuristic to place the low frequency 
coefficients early in the vector. Step three scales the result 
by dividing each coefficient by a constant. A different con- 
stant is used for each coefficient. These constants are 
usually arranged in a table, called the quantization table. 
Step four rounds the result to the nearest integer. Step five 



In most formulations, steps three and four are taken 
together and called quantization. We split them apart so 
that the first three steps can be combined into a linear oper- 
ator, J, since the DCT, zig-zag scanning, and scaling are aJl 
linear operations. 

With the first three steps combined, JPEG compression 
is the four step process as shown in Figure 3. The first step 
applies the linear transformation J to each 8X8 pixel 
block f in the input image. The output is a 64 element vec- 
tor F. The second step rounds each element of F to the 
nearest integer. The third step produces a sparse vector 
representation of F (called the semi-compressed, or SC, 
vector) using run length encoding, and the final step 
applies DPCM to the DC component and entropy encodes 
the SC-vector. 
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Figure 3. The JPEG compression/decompression process. 

Decompression of a block, also depicted in Figure 3, 
shows how entropy decodes the JPEG bitstream, inverts the 
DPCM to recover the SC-vector, and applies the linear 
transformation J" 1 to recover an approximation of the orig- 
inal pixel values. 

The JPEG-operator, J, is the composition of three linear 
operations: (i) a discrete cosine transform (DCT), (ii) zig- 
zag scanning, and (iii) scaling. If we write these steps as the 
tensors D, Z, and S, the J is given by 



Jjjl ~ S kJ Z uvk P nuv 

u,v,k 

D is a fourth rank tensor whose elements are 

D ijuv=— A («) A (v) cos- - ros-i-^ 1 



16 



16 



(3) 



(4) 



where the vector q is derived from the JPEG quantization 
table. 



J 1 is the inverse of J, and is similarly defined. 



(8) 



u,v,k 



where D" -1 is the IDCT 



D ijuv = — A (u) A( v) cos -f cos — — 



16 



16 



(9) 



S 1 is a diagonal operator: 



s -i = jq[k]ifl = k 

1 0 otherwise 



(10) 



with 



and Z 1 is the inverse zig-zag operator: 



A(a) = 



1/V2 if a = Q 
0 otherwise 



(5) 



Z is a third rank tensor whose elements are all 0 or 1. It is 
similar in spirit to a permutation matrix, since its function 
is to rearrange data. Its elements are 



1 if zigzagfu, v] = k 



tivk 1 0 otherwise 



and S is a diagonal, second rank tensor 
[l/q[k]if l = k 



S kl — 



0 otherwise 



(6) 



(7) 



, _ J 1 if zigzag[u,v]=k 

uvk — i 

0 otherwise 



(11) 



J is a third rank tensor that maps a second rank tensor (an 8 
X 8 pixel block) into a first rank tensor (a 64 element vec- 
tor), as shown in Figure 3. This mapping is expressed in 
the equation F^j^J^f... Similarly, J" 1 is a third rank 

i.j 

tensor that reverses this mapping, using = J-/F,. 

The special structure of Z and S (and their inverses) 
allow use to compute J and J -1 efficiently. The C code in 
Figure 4 shows a function, initOperators, which com- 
putes J and J"" 1 and then stores the result in two 
three-dimensional look-up tables, one for each operator. 
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tdefine PI 3.14159265358979323846 
•define SQRT2 1.4142135623730950488 
tdefine A(u) <{u)? 0.5 : 0.5/SQRT2) 

static double J[8] (8] [64] , Jinv[64) (8) [8] , C[8](8); 



InitOperators {) 
int i, j, k, u 
double tmp ; 



{ 



for (i=0; i<8; i++) { 

for (u=0; u<8; u++) { 

C[iJ[u] » A(u) *cos( (2*i+l) *u*PI/16.0) ; 

} 

for (k=0; k<64; k++) { 
u = zzu(k) ; 
v = zzvfkj ; 
for <i=0; i<8; { 

for (j=0; j<8; { 

tmP = C[i] [u]*C[j) fv]; 

tj] [kj = tinp/quantTablefk] ; 
Jinvtk] [i] [jj = tnrp*quantTablefk] ; 

> 

) 



Figure 4. C code for computing J and its inverse Jhat. 

The code uses three externally defined arrays, zzu, zzv, and 
qt, which encode the permutation specified by zig-zag 
ordering and the quanitization tables, respectively. 

Compressed Domain Processing 

Having formulated images, processing, JPEG compression, 
and JPEG decompression as tensors, these steps are easily 
combined. Consider processing a JPEG compressed gray- 
scale image. With the processing specified by the tensor T, 
the following steps, illustrated in Figure 5, are needed: 



(i) Decompress the input bitstream to form the SC image. 
The SC image is a two-dimensional array of first rank 
tensors called SC-vectors. The SC-vectors are denoted 
F ry or H wz > for lne in P ut or output image, respectively, 
and may be sparsely encoded. Each SC-vector corre- 
sponds to an 8 X 8 pixel block in the decompressed 
image. The subscripts specify the block offset. 

(ii) Convert each SC-vector to an 8 X 8 block of pixels by 
applying J" 1 : f xy = J -1 F xy . 

(iii) Compute 8X8 pixel block in the output image using 



the block transforms T 



wzxy 



' ^ xy Twzxy f xy 



xy 



(iv) Convert the output image to the SC representation using 

H wz = J h W z 

(v) Round, run-length, DPCM, and entropy encode H wz . 
Steps 2, 3 and 4 can be combined: 

y (j"'Fj]= £(jT^-% (12) 



H xy - J 



xy 



The term in parenthesis is the compressed domain equiva- 



lent of the block transform T 



wzxy* 



^wzxy J^wzxyJ 



-1 



(13) 



^wzxy 

is a block transform that computes an SC-vector in 
the output directly from the SC-vectors in the input. Using 
x wzxy ima ges can be processed in the compressed domain, 
as shown in Figure 6. The steps are: 

(i) Decompress the input bitstream to form the SC image. 



Huffman 
Decode + 
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Figure 5. Graphical depiction of spatialdomain processing. 
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Figure 6. Graphical depiction of compressed domain processing. 

(ii) Compute each output SC-vector using the block trans- 
form 



"wzxy 



T F 
L wzxy A xy * 



(iii) Round, run-length, DPCM, and entropy encode H w 



Each SC-vector in the output (H wz ) is computed by multi- 
plying each SC-vector in the input (F xy ) by its 
corresponding block transform (x wzxy ) and accumulating 
the results. Since SC-vectors are first rank tensors (i.e., vec- 
tors, or one-dimensional arrays), and block transforms, 
x w2xy , are second rank tensors (i.e., matrices, or two-dimen- 
sional arrays), the structure of the calculation is a sum of 
matrix/vector multiples. 

Returning to the example of shrinking a 32 X 16 image 
(4 X 2 blocks) to a 16 X 8 image (2X1 blocks), to com- 
pute the left SC-vector, (Figure 7), the SC-vectors F 00 , 

f ol> F 02> F 03> F to> F u> F L2> ^ d F i3 multiplied by the 
block transforms x, 



and Xqq^, respectively: 



0000' v 0OOL' ^0002' v 0003' V 0010' u 001l» "00L2' 



00 



C 0000 *00 0001 ^01 ^0002 02 ^0003 03 ^ 



L 00L0 10 ''OOLL 1 1 L 0012 12 



*0013 



Since, in shrinking by 2, the right most four blocks in F do 
not affect H^, the block transforms x 0002 , x { 



0003' W 00L2' 



W 00L3 



and 

are zero and can be ignored for efficiency. The other 
four block transforms, x^^, x 000l , x 00L0 , and x 001L , are 64 by 
64 matrices. Figure 8 shows what these matrices look like. 
The first 16 rows and colums of x^^ are shown in the top of 
the figure. A scatter plot showing the positions of the non- 
zero elements of is shown in the bottom of Figure 8. 

This example illustrates an important property of x: most 



SC image 



of the x wzxy are zero. This property allows the computation 
of output images to be performed efficiently. It holds for 
many compressed domain image operators, since in most 
image processing a pixel in the output image is a function 
of only a few pixels in the input. 

Despite the spareness of x, the compressed domain oper- 
ation is still slow. To see why, recall that the output 
SC-vector is the sum of a sequence of matrix/SC-vector 
multiples. Each SC-vector has 64 elements, and each of the 
multiplying matrices are 64 X 64, so 4K multiply/add oper- 
ations are required for each matrix/SC-vector multiply. 
With several such terms, the operation can count gets large. 
For example, shrink-by-2 requires four matrix/SC-vector 
multiplies per output block, so 16K multiply/add operations 
are required for each output block. Thus, an average of 
16K/64 = 256 multiplies is required per pixel (since each 
SC-vector represents 64 pixels), which is more expensive 
than the spatial domain operation. 

Since the SC-vectors are stored in a run length encoded 
format and typically sparse, sparse matrix techniques can 
be used to reduce the number of multiplies. But the compu- 
tation is still too expensive to compute in real-time on 
general purpose workstations. The next section develops an 
approximation technique that reduces this cost to a few 
multiples per pixel. 



Condensation 

This section describes a technique, called condensation, 
that approximates compressed domain operators so they 
can be efficiently computed. Condensation modifies the 
operator x to produce a new operator x' such that x' is 
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Figure 7. Shrink-by-2 filtering in the compressed domain. 



sparse and when x / is used to compute an effect, the result 
will be nearly identical to that computed using x. In other 
words, if H = xF and FT = x'F, then H H'. 

Since t' is sparse and the input vectors F are sparse, the 
resulting computation can be implemented efficiently. Two 
properties, one of x and one of the input vectors, make con- 
densation possible. 

(i) Most elements of x have small absolute values [1]. For 
example, 90% of the elements in shrink-by-2 have an 
absolute value less than 0.05. You can see this property 
by examining Figure 8. 
(ii) The input vectors F are sparse, and non-zero values are 
typically small integers. This property is expected, 
since the DCT concentrates the energy of the image 
into a few coefficients. Furthermore, JPEG quantizes 
high frequency components aggressively, leading to the 
small absolute values of these elements. 

These two properties allow us to approximate x as a 
sparse tensor as follows. An element in an output SC- vector 
is a linear combination of elements in a set of input 
SC-vectors. The elements themselves are small integers, 
and the coefficients of this linear combination are stored in 
x. Small elements of x, called insignificant elements, will 
have little effect on the value of this sum, since they will be 
multiplied by small integers, and the result will be rounded 



off anyway in the next step (Figure 6). In other words, why 
go to the trouble of computing the output to several decimal 
points if you are going to throw away the fraction anyway? 

We can exploit this observation by setting insignificant 
elements in x to zero. Doing so will reduce the number of 
operations required to compute H, but at the price of a 
small error in the output Such errors are likely to be 
undetected because JPEG compression introduces the same 
type of loss. Since the majority of the elements of x are 
insignificant (property 1), this optimization should save a 
large number of arithmetic operations. We call this process 
of setting elements of x to zero condensation. In effect, 
condensation introduces a dead zone into the operator: 
elements in the operator below a threshold are set to zero. 
The question is: what elements of x can we safely set to 
zero? 

To answer this question, let us formulate the concept of 
condensation more precisely. Recall that the value of an 
output vector H is computed as a sum of ma trix/SC- vector 
multiples: 

H wz = £r xyw2 F xy (14) 

xy 

Let N be the number of matrix/SC- vector multiplies in this 
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Figure 8. First 16 rows and columns of x^x^ for shrink-by-2 operation, and structure of its non-zero elements. 



sum (e.g., N = 4 for shrink-by-2). If we use the condensed 
operator x' instead of x, the error in H wz is 



AH. 



T F 

b xywz xy 



Yt' F =Vat' 

/ j xywz A xy / , x 



F 

xywz* xy 



(15) 



xy 



where Ax is the tensor composed of insignificant elements 
of x. The worst case error occurs when all elements in F are 
at their maximum value. In thresholding condensation, the 
tensors are condensed by guaranteeing that no element in 



AH^ will ever exceed a parameter maxerr in a worst-case 
scenario. If we let max k denote the worst-case (i.e., the 
largest) value of the fcth element of the SC-vector, the 
heuristic to zero an element of x is 



Y 



maxerr 



wzxykl j 



<54xNxmax k 



(16) 



Max k can be chosen statistically using data gathered from a 
large set of images [1]. These ideas lead directly to the fol- 
lowing algorithm 
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Algorithm (Thresholding Condensation) 

1. max: array [0..63] of integer; 

2. N : = number of transform tensors 

3. for each tensor x 

4. for k : = 0 to 63 do begin 

5. threshold : = maxErr/(64*N*max[k]); 

6. for 1 : = 0 to 63 do 

7. if (x[l][k] < = threshold) then 

8. x[l][k] = 0.0; 

9. end 

In this code, the array of T is a block transform tensor and 
max stores the largest expected value of an AC component 
of any SC-vector (see appendix A). In lines 7-8, any 
insignificant element, as specified by Eqn (16) is set to 
zero. 

The threshold is set so that the error in the output is 
bounded by maxerr. Unfortunately, when large values of 
maxerr are used, the block transform matrices, x can- 

, wzxy' 

not be condensed independently. To see why, suppose you 
had an operator with N = 3 (i.e., the output SC-vector is a 
linear combination of three input SC-vectors), and the 
value of the first AC component of the output SC-vector is 
given by 

H i = 2A, - B 0 - C 0 

where A 0 , B 0 , and C 0 are the DC components of the three 
input SC-vectors. If A 0 = B 0 = C 0 , the terms cancel and 
HI is zero. Now suppose condensation uses a threshold 
such that the two terms with B 0 and C 0 drop out. Then the 
new value for H t is H { = 2\. Since \ is the DC compo- 
nent, A 0 can be large, and in such a case the output 
component H t will be large and result in highly visible arti- 
facts in the output. Figure 9 shows a sold gray image 
filtered with a gaussian blur filter where each compressed 
domain tensor in the filter was condensed independently 
using thresholding condensation. The pattern of dots are 
artifacts caused by setting AC components of the output 
SC-vector, such as H p to relatively large values (the output 
should be uniform gray). 

This problem can be solved by introducing the concept 
of tensor bias. The tensor bias of x is defined as 

b wz (k,I)= £r wzxykl (17) 

xy 

Intuitively, tensor bias is a measure of how much the cross- 
block terms in Eqn (14) tend to cancel each other out when 




Figure 9. Gray image filtered with blur without constant bias. 

the tensor is used to compute the output block at w, z. In the 
example above, 0^(0,1) is 2 - 1 - 1 = 0 before conden- 
sation, and b W2 (0,l) is 2 after condensation. The change in 
tensor bias means that terms that cancelled each other out 
before condensation do not do so afterwards, resulting in 
artifacts such as those in Figure 9. 

We can remove these artifacts by adding the constant 
bias constraint to condensation: the tensor bias after con- 
densation should be the same as before. To implement the 
constant bias constraint, we calculate and store the bias of 
the tensor before applying a condensation algorithm, con- 
dense the tensor, and then adjust the . remaining non-zero 
elements to restore the bias to its previous value. More pre- 
cisely, we adjust the elements in the tensors by distributing 
the change in bias 8b equally among the non-zero elements 
remaining in die tensor after condensation. If no elements 
remain, a randomly selected tensor absorbs the change in 
bias. 8b is given by 

<5b W zki = b W2 (k,l) - b' wz (k,l) (18) 



Implementation and Experimental Results 

This section describes a set of experiments we performed to 
evaluate the effectiveness of compressed domain process- 
ing using condensation. The experiments characterize both 
the performance of the technique and the quality of the 
images computed using condensed operators. We first 
describe the implementation and then report the perfor- 
mance results. 

Our implementation is divided into two phases. In phase 
one, the compressed domain tensor, x, is computed, con- 
densed, and stored in a file. In the second phase, x is read 
from a file, the JPEG stream is entropy decoded to recover 
the SC-image, x is applied to the SC-vectors to compute the 
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output SC-image, and the result is encoded as a JPEG bit- 
stream. Phase one is executed off-line, whereas phase two 
operates in real-time. Since we are not concerned with the 
speed of off-line processing, our implementation is opti- 
mized for phase two. In practice, phase one takes a few 
seconds on a typical workstation. 

To make phase two efficient, we must develop a data 
structure for efficiently calculating the output SC-image 
from the input SC-image. This calculation can be written: 



(i) for all w,z in the output image 

(ii) zero the output SC-vector H 
(iii) for all x,y in the output image such that % 

all zero 

(iv) 



wzxy 



is not 



Compute F * 



T wzxy and ad<1 the result t0 H W2 



An efficient data structure will exploit the sparseness of the 
operators and the SC-vectors, and the redundancy in the 
block transforms. We used the data structures diagrammed 
in Figure 10. The SC-vectors of F xy are stored in the 
SparseVector data type. Each SparseVector has a field 
indicating the size of the array and an array of {index, 
value) pairs, which indicate the position and value of the 
SC-vector' s non-zero elements. Each matrix x wzxy is stored 
in a Sparse Matrix data structure, consisting of an array of 
64 pointers to SparseVectors indexed by L The 
SparseVectors in a SparseMatrix correspond to the columns 
of a block transform x^. In this usage, index specifies the row 



index / and value contains T kl . In our implementation, each 
unique SparseMatrix is stored in a table called the 
SparseMatrixTable, and offsets in this table are used to refer- 
ence a particular SparseMatrix. The set of block transforms 
needed to compute an output SC-vector is stored in a linked 
list of SparseMatrixRefs. A SparseMatrixRef is a tuple (x, y, 
num) where x and y indicate the block coordinates of the 
source SparseVector F xy used in line 4, and num indicates 
the offset of the matrix T wzxy in the SparseMatrixTable. The 
entire compressed domain tensor is stored in the 
TransformTable, a two-dimensional array of such lists. 

Using these structures, the calculation can be written: 

(i) for all w,z in the output image 

(ii) zero the output SC-vector H; 

(iii) SparseMatrixRef = TransformTable[w y z]; 

(iv) while (SparseMatrixRef ! = NULL) 

(v) SparseMatrix = SparseMatrixTable 

[SparseMatrixRef — > n] 

(vi) ApplyBlockTransform (F[SparseMatrixRef 

— > x, SparseMatrixRef - > y], SparseMatrix, 
H); 

SparseMatrixRef = SparseMatrixRef - > next; 



(vii) 



ApplyBlockTransform multiplies an SC-vector (its first 
parameter) by a sparsely encoded block transform (its sec- 
ond parameter) and accumulates the result in its third 
parameter. 



SparseMatrixRef 


























/ 










-1 

















SparseVector 



SparseMatrix 



Figure 10. Data structures used in the implementation. 
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The advantage of this representation is that it exploits the 
sparseness of x, it enables the inner loop of the compressed 
domain processing algorithm to be efficiently implemented, 
and it allows matrices to be shared. This last property 
allows operators with repeated tensors, such as convolu- 
tions, to be stored efficiently. 



Experimental Results 

Having sketched the data structures used in the implemen- 
tation, we now answer the question 'how well does it 
work?' This question gives rise to two questions: (i) how 
does the maxerr parameter of thresholding condensation 
affect the quaJity of the output image and (ii) how fast can 
an operation be computed on a current generation worksta- 
tion? We will answer these questions in turn. 

The maxerr parameter in thresholding condensation 
affects the time needed to compute the result and the qual- 
ity of the result. Experiments showed maxerr to be a 
non-intuitive measure. For example, with maxerr = 2000, 
reasonable quaJity images were produced. This is because 
thresholding condensation uses worse case values for the 
input blocks, which rarely occur. Further investigation 
showed the average number of multiples needed to calcu- 
late an output vector, which is a function of maxerr, 
provided a more meaningful measure of condensation than 
maxerr. 

We evaluated the distortion introduced by thresholding 
condensation for two operations: the blur operation, which 
convolves the image with a 7 X 7 Gaussian filter, and the 
shrink-by-2 operation, which shrinks an image by a factor 
of two along each dimension. In the experiment, 12 con- 
densed operators, corresponding to 12 different values of 
maxerr, were created for each test operator using threshold- 
ing condensation. We applied the resulting operators to 98 
randomly selected gray-scale images and measured two 
values: the average number of multiplies required to calcu- 
late an output vector and the signaJ-to-noise rauo (SNR) of 
the resulting images. SNR is defined as 



SNR = Wlog 



rms(O) 



rms(C-O) 



where rms(*) is the root mean squared over the image, C is 
the image calculated using the condensed tensor, and O is 
the image using the original (uncondensed) tensor. 

Figure 1 1 shows the effect of condensation on shrink-by-2. 




Figure 11. Effect of condensation for shrink-by-2. Top: uncon- 
densed (1100 muks/vector). Bottom: condensed at SNR ~ 30 
(330 mults/vector). 

The top image was computed with the uncondensed opera- 
tor (1100 mults/vector), and the bottom image was 
computed with the condensed operator at SNR = 33 (about 
330 mults/vector). Figure 12 shows the effect of condensa- 
tion on blur. The top image was computed with the 
uncondensed operator (about 5200 multiples for each out- 
put vector), and the bottom image was computed with the 
condensed operator at SNR = 30 (about 90 multiples for 
each output vector). These figures are intended to give the 
reader an intuitive feeling for the artifacts introduced by 
condensation and the relationship between SNR and image 
quality. Subjective evaluation by the authors indicate that at 
an SNR of about 30, the output quality is quite good, and at 
an SNR above 35, the output image is essentially identical 
to the image computed using the uncondensed operator. At 
SNR values less than 30, the quality of the image degrades 
rapidly. Figure 13 shows a graph of the mean SNR for blur 
and shrink-by-2 as a function of the number of multiplies. 

Table 1 compares the performance of the image space 
method with our implementation using shrink-by-2 and 
blur at various levels of condensation. The experiments 
were performed on the same test suite of 98 images used 
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Figure 12. Effect of condensation for blur. Top: uncondensed 
(5270 mults/vector). Bottom: condensed at SNR = 33 (90 
mults/vector). 

for our experiments. The tests used a prototype implemen- 
tation on a DEC 3000/400 workstation with 64 MBytes of 
memory. The table shows that, with reasonable quality 
output, shrink-by-2 can be applied to a 640 X 480 image 
at about 5 frames/s (fps) on our test workstation, and blur 
can be applied to the same image at about 3 fps. Shrink- 
by-2 is faster that blur because the image produced by 
shrink-by-2 is smaller than the image produced by blur, 
which speeds image encoding. The results scale approxi- 
mately linearly with image area, so shrink-by-2 and blur 
can be applied to 320 X 240 images at about 20 fps and 
12 fps, respectively. 

Tables 2 and 3 show the results of profiling our imple- 
mentation using the two test operators. The table divides the 
total execution time into four phases: Huffman Decoding, 
Huffman Encoding, Operator Application, and Overhead. 
Huffman Decoding is the time spent reading and decoding 
the JPEG file into SC-vectors. Huffman Encoding is the 
time spent encoding the output, including quantization, run 
length coding, DPCM, and bitstrearn generation. Operator 
Application is the time spent computing the product of the 
condensed operator and input SC-vectors, and Overhead is 




CO 40 - 



1000 



2000 3000 
mults 

Shrink-by-2 



4000 




200 400 600 800 1000 1200 
mults 

Figure 13. SNR of blur (top) and shrink-by-2 (bottom) vs. num- 
ber of multiples. 

Table 1. Speed of the blur and shrink-by-2 operation. 



Operator 


Test conditions 


Time (s) 


Speedup 


Blur 


SNR = 25 


0.290 


43.4 


Blur 


SNR = 30 


0.331 


38.0 


Blur 


Not condensed 


4.45 


2.83 


Blur 


Image space 


12.6 


1 


Shrink-by-2 


SNR = 25 


0.141 


5.36 


Shrink-by-2 


SNR = 30 


0.202 


3.74 


Shrink-by-2 


Not condensed 


0.328 


2.30 


Shrink-by-2 


Image space 


0.755 


1 



Table 2. Breakdown of time in computing the blur operation. 



Test conditions 



Huffman 
Decoding 



Huffman 
Encoding 



Operator Overhead 
application 



SNR = 25 
SNR = 30 
Not condensed 



16% 
13% 
1% 



21% 
19% 

2% 



39% 
45% 
95% 



24% 
23% 
2% 



Table 3. Breakdown of time in computing the shrink-by-2 opera- 
tion. 

Test conditions Huffman Huffman Operator Overhead 

Decoding Encoding application 

SNR = 25 47% 20% 23% 10% 

SNR = 30 42% 18% 31% 9% 

Not condensed 15% 7% 75% 3% 
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the time spent in control flow. In the blur transformation, 
less than half the time is spent in application of the con- 
densed operator. For shrink-by-2, only about one-quarter of 
the time is spent in operator application. The rest of the 
time is spent in overhead and in entropy coding operations. 
These results indicate that limited performance gains are 
possible by further condensation. 



Applications 

Using the techniques developed in this paper, many impor- 
tant image and video processing problems can be computed 
in die compressed domain: 

(i) Geometric operations. Image warping that uses opera- 
tions such as translation, rotation, scaling, shearing, 
and other affine transformations can be expressed 
using tensors [3]. For example, in scaling a 640 X 480 
image to 320 3 240, each pixel in the output image is 
the average (i.e., a linear combination) of the corre- 
sponding pixels in the input image, 
(ii) Finite impulse response (FIR) filters used in signal 
processing can be expressed using tensors. Such oper- 
ations, which include smoothing, embossing effects, 
edge detection, and image enhancement, can be con- 
veniently represented using convolution [4]. The 
convolution function specifies the linear combination 
of pixels in the input image that are used to calculate a 
pixel in the output image. 

(iii) De-interlacing. In this operation, two images, called 
the odd and even fields, are combined to form a single 
frame. The fields contain sample data from every other 
line in the video source. A frame is formed by inter- 
leaving lines from two fields. To express this 
operation as a tensor, the two fields are first scaled by 
a factor of two in the vertical dimension, with the 
missing lines set to 0 (black). The two resulting frames 
can then be added together pixel-wise to create the 
de-interlaced frame. 

(iv) Sampling conversion. Video is often represented as a 
luminance and two chrominance channels. The lumi- 
nance channel is a gray-scale image, whereas the 
chrominance channels contain the extra information 
necessary to produce a color image. To reduce storage 
and bandwidth costs, chrominance channels are often 
sampled at a different resolution than the luminance 
channel. For example, MPEG [5] uses 4:2:0 sampling, 
where the luminance image is stored at 352 X 240 
resolution but the chrominance images are stored at 
176 X 120 resolution. Other standards use different 
sampling, such as 4:2:2. To transform 4:2:2 video to 



4:2:0 video, the chrominance images must be down- 
sampled, a process analogous to image scaling, 
(v) Morphing is a striking video effect where objects in 
two video sequences are deformed and the images are 
cross-dissolved to create the illusion that one object is 
transforming into another [6j. The effect is achieved 
by applying affine transformations to sections of each 
image pair, which can be expressed as a tensor, the 
pixels in the resulting images are then multiplied by a 
scalar constant and added together, 
(vi) Image composition. In video composition, multiple 
video outputs are combined to form a single video out- 
put (e.g., chroma key). Such a composition can be 
expressed using a combination of image translation 
and scaling on the input images, pixel-wise multiplica- 
tion with a mask image, and pixel-wise additions [7, 
8], This type of video mixing has been proposed as the 
basis for next generation video conferencing systems 
that create a virtual converence table, where camera 
inputs from other conference participants are mixed to 
form a composite signal that displays the other atten- 
dees seated around a table [9]. 

In practice, the memory needed to store the compressed 
domain tensor t is too large to be practical unless the spa- 
tial domain operation exhibits enough symmetry that the 
block transforms can be shared. Of the examples above, 
only those that use image warping, such as morphing and 
general affine geometric transforms, are impractical for this 
reason. 



Related Work 

Image and video data processing in the spatial domain is a 
well-studied field. The work in this field can be divided into 
hardware designs for video processing [10-12], applica- 
tions that perform video processing off-line [13-15], and 
software techniques and algorithms for image processing 
[3, 4, 7, 16]. Less research has been done on processing 
video data in the compressed domain. Chitprasert and Rao 
developed a restricted form of the convolution theorem for 
tHe JDCT similar to the DFT convolution theorem [17], 
Chang and Messerschmitt developed a technique for com- 
positing motion compensated video in the compressed 
domain [8, 18]. Their work can be viewed as a special case 
of a translational operator coupled with factoring for 
improved efficiency. More recently, Natarajan and 
Bhaskaran [19] have found an efficient method for the 
special case of the shrink-by-2 operation in the compressed 
domain operator. Their method approximates the elements 
in the compressed domain operator using powers of two, 
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allowing the result to be computed using only shifts and 
adds. Shen and Sethi have similarly examined inner block 
transforms, operations whose range is confined to a single 8 
X 8 block [20]. Arman has developed a techinique to detect 
scene changes in motion-JPEG compressed video data in 
the DCT domain [21]. Seales has examined the problem of 
object recognition in the compressed domain [22]. 
Broadhead and Owen have extended these technqiues 
MPEG compressed audio data [23]. 

Many extensions to the work presented in this paper are 
possible. Condensation algorithms can be developed that 
improve the overall image quality using better metrics for 
finding insignificant elements than thresholding condensa- 
tion. For example, the first author's dissertation [1] 
explored an algorithm that bounds the average case error 
rather than the worst case error. Unfortunately, the results 
were no better than those obtained using thresholding con- 
densation. 

The technique of expressing compression and transfor- 
mation as linear operators, composing the operators, and 
condensing them to produce good approximations of the 
operator can be applied to a wide variety of other trans- 
form-based coding strategies (e.g., wavelet encoding). An 
interesting research question is how the use of other trans- 
forms will affect the trade-off output quality and 
computation time. Careful study in this area might lead to 
good transcoding techniques that efficiently convert 
between different compressed representations. 

Finally, rather than asking 'how fast can I process data in 
this compressed format?' perhaps a better question is 'can a 
compression format be developed that allows fast process- 
ing?' If a compression format can be developed that allows 
rapid processing and transcoding to popular compression 
standards such as motion-JPEG or MPEG, it would make 
an excellent format for secondary storage on video servers 
with heterogenous clients. Video servers could then store a 
single format and convert it to the appropriate client format 
in real time. 
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JPEG standard [25]. The value maxK[i] indiates the 
largest value of the SC-vector element F[i] seen in all 
622 images. Thus, maxK[i] represents a practical upper 
limit of the absolute value of F [ i ] . 



Appendix A 

This appendix lists the experimentally determined values 
for maxK used in thresholding condensation in section 4. 
The data was gathered by examining 622 images from 14 
categories stored on an FTP archive [24], compressed using 
the default quantization tables presented in Annex K of the 
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